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Abstract 

We prove the correctness of a recently-proposed cache coherence protocol, Tardis, which is simple, 
yet scalable to high processor counts, becanse it only requires 0(log A) storage per cacheline for an N- 
processor system. We prove that Tardis follows the sequential consistency model and is both deadlock- 
and livelock-free. Onr proof is based on simple and intuitive invariants of the system and thus applies 
to any system scale and many variants of Tardis. 


1 Introduction 


Tardis 37 


is a recently proposed cache coherence protocol that is able to scale to a large number of cores. 
Unlike full-map directory protocols [7 35 , Tardis does not keep the 0{N) {N is the core count) sharer 
information for each cacheline. In Tardis, only the owner ID of each cacheline (O(logA)) and logical 
timestamps (0(1)) for each cacheline are maintained. Unlike the snoopy bus coherence protocol [^, or 


limited directory protocols such as ACKwise 20 , Tardis does not broadcast messages to maintain coherence. 


In Tardis, each load or store is assigned a logical timestamp which may not agree with the physical time. 
The global memory order then simply becomes the timestamp order which is explicit in the protocol. This 
makes it much simpler to reason about the correctness of Tardis. Despite its simplicity, however, no proof of 
correctness has yet been published. We provide a simple and straightforward proof in Section]^ our proof 


is simpler than existing proofs for snoopy and directory protocols such as 32,36 


In this paper, we formally prove the correctness of the Tardis protocol by showing that an execution of 
a program using Tardis strictly follows Sequential Consistency (SC). We also prove that the Tardis protocol 
can never deadlock or livelock. 


The original Tardis protocol 37 was designed for a shared memory multicore processor. A number of 


optimization techniques were applied for performance improvement. These optimizations, however, may not 
be desirable in other kinds of shared memory systems. Therefore, in this paper we first extract the core 
algorithm of Tardis and prove its correctness. We then focus on correctness of generalizations of the base 
protocol. 

We prove the correctness of Tardis by developing simple and intuitive system invariants. Compared to 


the popular model checking 12 18 28 verification techniques, our proof technique is able to scale to high 


processor counts. More important, the invariants we developed are more intuitive to system designers and 
thus provide more guidance for system implementation. 

The rest of the paper is organized as follows. The Tardis protocol is formally defined in Section It 
is proven to obey sequential consistency in Section and to be deadlock-free and livelock-free in Section 
In Section we generalize the proofs to systems with main memory. Section describes related work and 
Section [T] concludes the paper. 
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Figure 1: Architecture of a shared memory multicore processor. 


2 Tardis Coherence Protocol 

We first present the model of the shared memory system we use, along with our assumptions, in Section [2.1[ 
Then, we introduce system components of the Tardis protocol in Section [2^ and formally specify the protocol 
in Section 1231 

2.1 System Description 

Fig.[T] shows the architecture of a shared memory system based on which Tardis will be defined. The 
processors can execute instructions in-order or out-of-order but always commit instructions in order. A 
processor talks to the memory subsystem through a pair of local buffers (LB). Load and store requests are 
inserted into the memory request buffer {mRq) and responses from the memory subsystem are in the memory 
response buffer (mRp). 

We model a two-level memory subsystem with all the data fitting into the L2 caches. The network 
between LI and L2 caches is modeled as buffers. c2pRq contains requests from LI (child) to L2 (parent), 
e2pRp contains responses from LI to L2, and p2c contains messages (both requests and responses) from L2 
to LI. For simplicity, all the buffers are modeled as FIFOs and getjmsgi) returns the head message in the 
buffer. However, the protocol also works if the buffers only have the FIFO property for the same address. 
Each LI cache has a unique id from 1 to A^ and each associated buffer has the same id. An LI cacheline or 
a message in a buffer has the same id as the LI cache or the buffer it is in. 

2.2 System Components in Tardis 

The Tardis protocol is built around the concept of logical timestamps. Each memory operation in Tardis has 
a corresponding timestamp which indicates its global memory order. The memory dependency is expressed 
using timestamps and no sharer information is maintained. For simplicity, we assume the timestamps to be 
large enough that they never overflow (e.g., 64-bits). Timestamp compression algorithms are able to achieve 
much smaller timestamps (e.g., 16-bits) in practice [37| . 

At a high level, if a load observes the value of a previous store, then the load should be ordered after 
the store in the logical time (and thus global memory order). Similarly, a store should be ordered after a 
load if the load does not observe the stored value. To keep track of the timestamps of each operation, every 
cacheline in Tardis has a read timestamp (rts) and a write timestamp (wts). The wts is the timestamp of 
the last store and rts is the timestamp of the last (potential) load to the cacheline {wts < rts). Similar to a 
directory protocol, a cacheline can be cached in LI in either M or S states. Only one LI can obtain the M 
state at any time to modify the cacheline, and multiple Lls can share the line in the S state. Timestamps 
are also required for messages in the buffers. Table summarizes the format of caches and buffers modeled 
in the system. The differences between Tardis and a directory protocol are highlighted in red. 

An LI cacheline contains five fields: state, data, busy, wts and rts. The state can be M, S or I. For ease 
of discussion, we define a total ordering among the three states I < S < M. A cacheline has busy = True if 
a request to L2 is outstanding. This prevents duplicated requests. An L2 cacheline contains one more field 
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Table 1: System Components 


Component 

Format 

Message Types 

Ll 

Ll[addr] = {state, data, busy, wts, rts) 

- 

L2 

L2[addr] = {state, data, busy, owner, wts, rts) 

- 

mRq 

mRq.entry = {type, addr, data, pts) 

LdRq, StRq 

mRp 

mRp.entry = {type, addr, data, pts) 

LdRp, StRp 

c2pRq 

c2pRq.entry = {id, type, addr, pts) 

Gets, GetM 

c2pRp 

c2pRp.entry = {id, addr, data, wts, rts) 

WBRp 

p2c 

p2c.entry = {id, msg, addr, state, data, wts, rts) 

ToS, ToM, WBRq 


Table 2: State Transition Rules for LI. 


Rules and Condition 

Action 

LoadHit 

mRq. deq{) 

let {type, addr, pts) = mRq.geRmsg{) 

mRp.enq{type, addr, data, max(pts, wts)) 

let {state, data, busy, wts, rts) = Ll[addr] 

If {state = M) 

condition: -i busy A type = S A {state > S A pts < rts) 

rts := max(pts, rts) 

StoreHit 

let pts' = niax(pts, rts + 1) 

let {type, addr, data, pts) = mRq.get_msg{) 

mRq.deq{) 

let {state, data, busy, wts, rts) = Ll[addr] 

mRp.enq{type, addr, pts') 

condition: -i busy A type = state = M 

wts := pts' 
rts := pts' 

LlMiss 

c2pRq.en.q{id, type, addr, pts) 

let {type, addr, data, pts) = mRq.get_msg{) 

busy := True 

let {state, data, busy, wts, rts) = Ll[addr] 

condition: -i busy A {state < type V state = type = S A pts > rts) 


L2Resp 

p2c.deq{) 

let {id, msg, addr, state, data, wts, rts) = p2c.get_msg{) 

llstate := state 

let {llstate, lldata, busy, llwts, llrts) = Ll[addr] 

11 data •— data 

condition: msg = Resp 

busy := False 
llwts := wts 
llrts := rts 

Downgrade 

If {state = M) 

let {state, data, busy, wts, rts) = Ll[addr] 
condition: -i busy A 3 state '. state' < state 

c2pRp.enq{id, addr, data, wts, rts) 
state := state' 

A LoadHit and StoreHit cannot fire 


WriteBackReq 

p2c.deq{) 

let {state, data, busy, wts, rts) = Ll[addr] 

If {state = M) 

condition: p2c.get-msg{).msg = Req 

c2pRp.enq{id, addr, data, wts, rts) 

A LoadHit and StoreHit cannot fire 

state := S 


owner^ which is the id of the LI that exclusively owns the cacheline in the M state. As in LI, busy in L2 is 
set when a write back request ( WBRq) to an LI is outstanding. 

Each entry in mRq contains four fields: type, addr, data and pts. The type can be S' or M corresponding 
to a load or store request, respectively. The pts is a timestamp specified by the processor and the timestamp 
of the memory operation should be no less than pts. mRp has the same format as mRq, but pts here is the 
actual timestamp of the memory operation. 

The format of the messages in the network buffers {c2pRq, c2pRp and p2c) is shown in Table where 
the meaning of the fields are as in the cachelines or the messages in mRq. All network messages have a field 
id which is the id of the LI cache that the message comes from or goes to. Messages in p2c have a field 
msg, which can be either Req or Resp; p2c may contain both requests and responses from L2 to LI and msg 
differentiates the two types. 

2.3 Protocol Specification 

We now formally define the core algorithm of the Tardis protocol. The state transition rules for LI and 
L2 caches are summarized in Table and Table respectively, with the differences between Tardis and a 
directory protocol highlighted in red. For all rules where a message is enqueued to a buffer, the rule can 
only fire if the buffer is not full. 
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Table 3: State Transition Rules for L2. 


Rules and Condition 

Action 

ShReq S 

let {id, type, addr, pts) = c2pRq.get_msg{) 

let {state, data, busy, owner, wts, rts) = L2[addr] 

condition: type = S A state = S 

A 3 pts'. pts' > rts A pts' > pts 

c2pRq.dee{{) 
rts := pts' 

p2c.enci{id, Resp, addr, S, data, wts, 
pts') 

ExReq_S 

let {id, type, addr, pts) = c2pRq.get_msg{) 

let {state, data, busy, owner, wts, rts) = L2[addr] 

condition: type = M A state = S 

c2pRq.deq{) 
state := M 
owner := id 

p2c.enq{id, Resp, addr, M, data, wts, 
rts) 

Req_M 

let {id, type, addr, pts) = c2pRq.get^msg{) 

let {state, data, busy, owner, wts, rts) = L2[addr] 

condition: state = M A —< busy 

p2c.en.q{owner, Req, addr, _) 

busy := True 

WriteBackResp 

let {id, addr, data, wts, rts) = c2pRp.get_msg{) 

let {state, I2data, busy, owner, I2wts, I2rts) = L2[addr] 

c2pRp.deq{) 
state := S 

I2data := data 
busy := False 

I2wts := wts 

I2rts := rts 


An important concept in Tardis is the lease. For a cacheline shared in an LI cache, it also contains a lease 
which expires at the current rts. The data is only valid if the lease has not expired, i.e., pts from the request 
is less than or equal to rts. The rts in the L2 cache is the maximum of the rts of all the sharing LI caches. 
When a shared cacheline in L2 cache gets a GetM request, it does not send invalidations as in a directory 
protocol, rather, it immediately returns exclusive ownership to the requesting processor, which will jump 
ahead in logical time and perform the store at rts + 1. If we consider logical time, the store happens after 
all the loads that do not observe its value, although in physical time, the store and the loads may happen 
simultaneously. 

Specifically, the following six transition rules may fire in an LI cache. 

1. LoadHit. LoadHit can fire if the requested cacheline is in the M state or is in the S state and the 
lease has not expired. If it is in the M state, then rts is updated to reflect the latest load timestamp. The 
pts returned to the processor is no less than the cacheline’s wts, which orders the load after the previous 
store in logical time. 

2. StoreHit. StoreHit can only fire if the requested cacheline is in the M state in the LI cache. Both 
wts and rts are updated to the timestamp of the store operation which is at least rts+ 1. The store is thus 
logically ordered after all concurrent loads on the same address in other Lis. 

3. LlMiss. If neither LoadHit nor StoreHit can fire for a request and the cacheline is not busy, it is an 
LI miss and the request {GetS or GetM) is forwarded to the L2 cache. The cacheline is then marked as busy 
to prevent sending duplicated requests. 

4. L2Resp. A response from L2 sets all the fields in the LI cacheline. The busy flag is reset to False 
and the cacheline can serve the next request in the mRq. 

5. Downgrade. A cacheline in the M or S states may downgrade if the cacheline is not busy and 
LoadHit and StoreHit cannot fire. For M to S or I downgrade, the cacheline should be written back to the 
L2 in a WBRp message. S to I downgrade, however, is silent, i.e., no message is sent. 

6. WriteBackReq. When a cacheline in M state receives a write back request, the cacheline is returned 
to L2 in a WBRp message and the LI state becomes S. If the request is to a cacheline in S' or / state, the 
request is simply ignored. This corresponds to the case where the line self downgrades after the write back 
request ( WBRq) is sent from the L2. 

The following four rules may fire in the L2 cache. 

7. ShReq.S. When a cacheline in the S state receives a shared request (i.e., GetS), both the rts and 
the returned pts are set to pts' which can be any timestamp greater than or equal to the current rts and 
pts. The pts' indicates the end of the lease for the cacheline. And the cacheline may be loaded at any logical 
time between wts and pts'. The returned message is a ToS message. 
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8. ExReq_S. When a cacheline in the S state receives an exclusive request (i.e., GetM), the cacheline 
is instantly returned in a ToM message. Unlike in a directory protocol, no invalidations are sent to the 
sharers. The sharing processors may still load their local copies of the cacheline and such loads have smaller 
timestamps than the store from the new owner processor. 

9. Req_M. When a cacheline in the M state receives a request and is not busy, a write back request 

(i.e., WBRq) is sent to the current owner, busy is set to True to prevent sending duplicated requests. 

10. WriteBackResp. Upon receiving a write back response (i.e., WBRp), data and timestamps are 
written to the L2 cacheline. The state becomes S and busy is reset to False. 


3 Proof of Correctness 


We now prove the correctness of the Tardis protocol specified in Section [T3| by proving that it strictly follows 


sequential consistency. We first give the dehnition of sequential consistency in Section 3.1 and then introduce 
the basic lemma (Section |3.2[ ) and timestamp lemmas (Section |3.3[ ) that are used for the correctness proof. 

Most of the lemmas and theorems in the rest of the paper are proven through induction. For each lemma 
or theorem, we first prove that it is true for the initial system state (base case) and then prove that it is still 
true after any possible state transition assuming that it was true before the transition. 

In the initial system state, all the LI cachelines are in the / state, all the L2 cachelines are in the S 
state and all the network buffers are empty. For all cachelines in LI or L2, wts = rts = 0 and busy = False. 
Requests from the processors may exist in the mRq buffers. For ease of discussion, we assume that each 
initial value in L2 was set before the system starts at timestamp 0 through a store operation. 


3.1 Sequential Consistency 

According to Lamport [22| , a parallel program is sequentially consistent if “t/ie result of any execution is 
the same as if the operations of all processors were executed in some sequential order, and the operations of 
each individual processor appear in this sequence in the order specified by its program”. Using <m and <p 
to represent the global memory order and program order per processor respectively, sequential consistency 
(SC) is defined as follows [34]. 

Definition 1 (Sequential Consistency). An execution of a program is sequentially consistent iff 
Rule 1: VA, y G {Ld,St\ from the same processor, X <pY => X <m Y. 

Rule 2: Value of L(a) = Value of Max<^rn{S{a)\S{a) <m L{a)}, where L(a) and S{a) are a load and a 
store to address a, respectively, and Max<^rn selects the most recent operation in the global memory order. 

In Tardis, the global memory order of sequential consistency is expressed using timestamps. Specihcally, 
Theorem 1 states the invariants in Tardis that correspond to the two rules of sequential consistency. Here, 
we use <ts and <pt to represent (logical) timestamp order that is assigned by Tardis and physical time order 
that represents the order of events, respectively. 

Theorem 1 (SC on Tardis). An execution on Tardis has the following invariants. 

Invariant 1: Value of L(a) = Value of Max^ts{S{a)\S{a) <ts L{a)}. 

Invariant 2: V5'i(a), £' 2 ( 0 ), £ 1 ( 0 ) f=ts £ 2 ( 0 )- 
Invariant 3: V£(a), L(a), £(a) =ts L{a) S{a) <pt L{a). 

Theorem itself is not enough to guarantee sequential consistency; we also need the processor model 
described in Definition The processor should commit instructions in the program order, which implies 
physical time order and monotonically increasing timestamp order. Both in-order and out-of-order processors 
fit this model. 

Definition 2 (In-order Commit Processor). yX,Y G {Ld,St} from the same processor, X <p Y X <ts 
YAX<ptY. 

Now we prove that given Theorem and our processor model, an execution obeys sequential consistency 
per Definition 1. We first introduce the following definition of the global memory order in Tardis. 


5 




Definition 3 (Global Memory Order in Tardis). 

X <^Y ^ X <tsYV X =t,Y AX <pt Y. 

Theorem 2. Tardis with in-order commit processors implements Sequential Consistency. 

Proof. According to Definitions 2 and 3, A <p T => A <ts YAX <pt Y ^ X <m Y. So Rule 1 in Definition 
1 is obeyed. 

<5'(a) <ts A(a) => S(a) <ts L{a) V S{a) =ts L{a). By Invariant 3 in Theorem 1, this implies S{a) <ts 
L{a) => S{a) <ts L{a) V S{a) =ts L{a) A S{a) <pt L(a). Thus, from Definition S{a) <ts L{a) 
S{a) <m L{a). We also have S{a) <m L{a) ^ sla) <ts L{a) from Definition|^ So {^o)|S'(a) <ts A(a)} = 
{S'(a)|5'(a) <m L{a)}. According to Invariant 2 in Theorem 1, all the elements in {S'(a)|5'(a) <m L{a)} have 
different timestamps, which means <m and <ts indicate the same ordering. Finally, Invariant 1 in Theorem 
1 becomes Rule 2 in Definition 1. □ 

In the following two sections, we focus on the proof of Theorem 1. 

3.2 Basic Lemma 

We first give the definition of a clean block for ease of discussion. 

Definition 4 (Clean Block). A clean block can be an L2 cache 
or a ToM or WBRp message in a network buffer. 

Lemma 1. V address a, at most one clean block exists. 

The basic lemma is an invariant about the cacheline states 
and the messages in network buffers. No timestamps are in¬ 
volved. 

A visualization of Lemma[2is shown in Fig. [^where a solid 
line represents a clean block. When the L2 state for an address 
is S, no LI can have that address in M state, and no ToM and 
WBRp may exist. Otherwise if the L2 state is M, either a ToM response exists, or an LI has the address in 
M state, or a WBRp exists. Intuitively, Lemmasays only one block in the system can represent the latest 
data value. 

Lemma [7] Proof. For the base case, the lemma is trivially true since exactly one clean block exists for each 
address and the block is an L2 cacheline in S state. We now consider all the possible transition rules that 
may create a new clean block. 

Only the ExReq.S rule can create a ToM response. However, the rule changes the state of the L2 cacheline 
from S to M and thus removes a clean block. Only the L2Resp rule can change an LI cacheline state to M. 
However, it removes a ToM response from the p2c buffer. Both Downgrade and WriteBackReq can enqueue 
WBRp messages and both will change the LI cacheline state from M to S' or /. Only WriteBackResp changes 
the L2 cacheline to S state but it also dequeues a WBRp from the buffer. 

In all of these transitions, a clean block is created and another one is removed. By the induction 
hypothesis, at most one clean block per address exists before the current transition, and still at most one 
clean block for the address exists after the transition. For other transitions not listed above, no clean block 
can be created so at most one clean block per address exists after any transition, proving the lemma. □ 

3.3 Timestamp Lemmas 

Lemma 2. At the current physical time, a clean block has the following invariants. 

Invariant 1 Its rts is no less than the rts of all the other blocks (in caches and messages) with respect 
to the same address. 

Invariant 2 Till the current physical time, no store has happened to the address at timestamp ts such 
that ts > rts. 


ine in S state, or an LI cacheline in M state. 


L1 state = M 

'toM \WBRp 

L2 state = S/ response 


L2 state = M 

Figure 2: A visualization of Lemma 
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Proof. We prove the lemma by induction on the transition sequence. For the base case, only one block exists 
per address so Invariant 1 is true. All the stores so far happened at timestamp 0 which equals the rts of all 
the clean blocks, proving Invariant 2. 

According to Lemma I, for an address, exactly one clean block exists. By the induction hypothesis, if 
no timestamp changes and no clean block is generated. Invariant 1 is still true after the transition. By the 
transition rules, wts or rts can only be increased if the block is an L2 cacheline in the S state or an LI 
cacheline in the M state. In both cases the block is clean. After the transition, the rts of the clean block 
increases and is still no less than the rts of other cachelines with the same address. 

Similarly, by the induction hypothesis. Invariant 2 is true after the transition if no store happens and no 
clean block is generated. Only StoreHit can perform a store to a clean block, which changes both wts and rts 
to be max(pts, rts + I). For the stored cacheline, since no store has happened with timestamp ts such that 
ts > old-rts (induction hypothesis), after the transition, no store, including the current one, has happened 
with timestamp ts such that ts > max(pts, old-rts + 1) > old-rts. 

Consider the last case where a clean block is generated at the current transition. Here, according to 
Lemma another clean block disappears. In all the transitions, the rts of the new clean block equals the 
rts of the old block. Thus, both invariants are still true after the transition. □ 

Lemma 3. For any block B in a cache or a message (WBRp, ToS and ToM), the data value associated with 
the block comes from a store St which has happened before the current physical time, and no other store Sf 
has happened such that St.ts < Sf .ts < B.rts, where St.ts is the timestamp of the store St and B.rts is the 
rts of block B. 

Proof. We prove the lemma by induction on the transition sequence. For the base case, each block has a 
corresponding store which happened before the system started and is thus before the current physical time. 
It is also the only store happened per address. Therefore the hypothesis is true. 

We first prove that after a transition, for each block, there exists a store St which creates the data of 
the block and St happened before the current physical time. Consider the case where the data of a block 
does not change or comes from another block. Then, by the induction hypothesis, St must exist for the 
block after the transition. The only transition that changes the data of a block is StoreFlit. After the store, 
however, the data stored in the cacheline comes from the current store which has just happened in physical 
time. 

We now prove the second part of the lemma, that for any block B, no store St' ^ St has happened such 
that St.ts < St'.ts < B.rts. By the induction hypothesis, for the current transition, if no data or rts is 
changed in any block or if a block copies data and rts from an existing block, then the hypothesis is still true 
after the transition. The only cases in which the hypothesis may be violated are when the current transition 
changes rts or data for some block, which is only possible for LoadHit, StoreHit and ShReqS. 

For LoadHit, if the cacheline is in the S state, then rts remains unchanged. Otherwise, the cacheline must 
be a clean block, in which case rts is increased. Similarly, ShReqS increases the rts and the cacheline must 
be a clean block again. By Invariant 2 in Lemma no store has happened to the address with timestamp 
greater than rts. And thus after the rts is increased, no store can have happened with timestamp between 
the old rts to the new rts. By the induction hypothesis, we also have that no store St' could have happened 
such that St.ts < St'.ts < oldjrts. These two inequalities together prove the hypothesis. 

For StoreHit, both rts and data are modified. For the stored cacheline, after the transition, St.ts = wts 
= rts = maxfpts, old-rts + 1). Thus, no St' may exist in this situation. For all the other cachelines, by 
Invariant 1 in Lemma their rts is no greater than the old rts of the stored cacheline and is thus smaller 
than the timestamp of the current store. By the induction hypothesis, no store St' exists for those blocks 
before the transition. Thus, in the overall system, no such store St' can exist, proving the lemma. □ 

Finally, we prove Theorem 1. 

Theorem 1 Proof. According to Lemmafor each L{a), the loaded data is provided by an S{a) and no 
other store S'(a) has happened between the timestamp of S{a) and the rts. And thus no S'(a) has happened 
between the timestamp of S{a) and the timestamp of the load which is no greater than rts by the transition 
rules. Therefore, Invariant 1 in Theorem is true. 
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By the transition rules, a new store can only happen to a clean block and the timestamp of the store is 
max(pts, rts+ 1). By Invariant 2 in Lemmaj^ for a clean block at the current physical time, no store to the 
same address has happened with timestamp greater than the old rts of the cacheline. Therefore, for each 
new store, no store to the same address so far has the same timestamp as the new store, because the new 
store’s timestamp is strictly greater than the old rts. And thus no two stores to the same address may have 
the same timestamp, proving Invariant 2. 

Finally, we prove Invariant 3. If S{a) =ts L{a), by Invariant 1 in Theoremj^ L{a) returns the data stored 
by S{a). Then by Lemmathe store S{a) must have happened before L{a) in the physical time. □ 


4 Deadlock and Livelock Freedom 


In this section, we prove that the Tardis protocol specified in 
and livelock-free (Section [T^. 


Section]^ is both deadlock-free (Section 4.1) 


4.1 Deadlock Freedom 

Theorem 3 (Deadlock Freedom). After any sequence of transitions, if there is a pending request from any 
processor, then at least one transition rule (other than the Downgrade rule) can fire. 

Before proving the theorem, we first introduce and prove several lemmas. 

Lemma 4. If an LI cacheline is busy, either a GetS or GetM request exists in its c2pRq buffer or a ToS or 
ToM response exists in its p2c buffer. 

Proof. This can be proven through induction on the transition sequence. In the base case, all the LI 
cachelines are non-busy and the hypothesis is true. An LI cacheline can only become busy through the 
LlMiss rule, which enqueues a request to its c2pRq buffer. A request can only be dequeued from c2pRq 
through the ShReqS or ExReqS rule, which enqueues a response into the same Li’s p2c buffer. Finally, 
whenever a message is dequeued from the p2c buffer {L2Resp rule), the LI cacheline becomes non-busy, 
proving the lemma. □ 

Lemma 5. If an L2 cacheline is busy, the cacheline must be in state M. 

Proof. This lemma can be proven by induction on the transition sequence. For the base case, no cachelines 
are busy and the hypothesis is true. Only Req^M makes an L2 cacheline busy but the cacheline must be 
in the M state. Only WriteBackResp downgrades an L2 cacheline from the M state but it also makes the 
cacheline non-busy. □ 

Lemma 6. For an L2 cacheline in the M state, the id of the clean block for the address equals the owner of 
the L2 cacheline. 

Proof. According to Lemmajl] exactly one clean block exists for the address. If the L2 state is M, the clean 
block can be a ToM response, an LI cacheline in the M state or a WBRp. We prove the lemma by induction 
on the transition sequence. 

The base case is true since no L2 cachelines are in the M state. We only need to consider cases wherein 
a clean block is created. When ToM is created {ExReqS rule), its id equals the owner in the L2 cacheline. 
When an LI cacheline in the M state is created {L2Resp rule), its id equals the id of the ToM response. 
When a WBRp is created {WriteBackReq or Downgrade rule), its id equals the id of the LI cacheline. By 
the induction hypothesis, the id of a newly created clean block always equals the owner in the L2 cacheline 
which does not change as long as the L2 cacheline is in the M state. □ 

Lemma 7. For a busy cacheline in L2, either a WBRq or a WBRp exists for the address with id matching 
the owner of the L2 cacheline. 



Proof. We prove the lemma by induction on the transition sequence. For the base case, no cacheline is busy 
and thus the hypothesis is true. We only need to consider the cases where an L2 cacheline is busy after the 
current transition, i.e., ^busy ^ busy and busy ^ busy. 

Only the Req.M rule can cause a ^ busy ^ busy transition and the rule enqueues a WBRq into p2c with 
id matching the owner and therefore the hypothesis is true. 

For busy busy, the lemma can only be violated if a WBRq or WBRp with matching id is dequeued. 
However, when a WBRp is dequeued, the cacheline becomes non-busy in L2 [WriteBackResp rule). If a 
WBRq is dequeued and the LI cacheline is in the M state, a WBRp is created with a matching id. So the 
only case to consider is when the WBRq with matching id is dequeued, and the LI cacheline is in the S or 
I states, and no other WBRq exists in the same p2c buffer and no WBRp exists in the c2pRp buffer. 

The L2 cacheline can only become busy by sending a WBRq. The fact that the dequeued WBRq is the 
only WBRq in the c2pRq means that the L2 cacheline has been busy since the dequeued WBRq was sent 
(otherwise another WBRq will be sent when the L2 cacheline becomes busy again). Since p2c is a FIFO, 
when the WBRq is dequeued, the messages in the p2c must be sent after the WBRq was sent. By transition 
rules, the L2 cacheline cannot send ToM while being busy, so no ToM may exist in the p2c buffer when 
WBRq dequeues. As a result, no clean block exists with id = owner. Then, by Lemma no clean block 
exists for the address (L2 is in the M state because of Lemma which contradicts Lemma □ 

Theorem^ Proof. If any message exists in the c2pRp buffer, the WriteBackResp rule can fire. Consider the 
case where no message exists in c2pRp buffer. If any message exists in the p2c buffer’s head, the L2Resp rule 
can fire, or the WriteBackReq, LoadHit or StoreHit rule can fire. For the theorem to be violated, no messages 
can exist in the c2pRp or p2c buffer. Then, according to Lemma all cachelines in L2 are non-busy. 

Now consider the case when no message exists in c2pRp buffer or p2c buffer and a GetS or GetM request 
exists in c2pRq for an LI cache. Since the L2 is not busy, one of ShReqS, ExReqS and Req.M can fire, 
which enqueues a message into the p2c buffer. 

Consider the last case where there is no message in any network buffer. By Lemma all LI cachelines 
are non-busy. By the hypothesis, there must be a request in mRq for some processor. Now if the request is a 
hit, the corresponding hit rule {LoadHit or StoreHit) can fire. Otherwise, the LlMiss rule can fire, sending 
a message to c2pRq. □ 

4.2 Livelock Freedom 

Even though the Tardis protocol correctly follows sequential consistency and is deadlock-free, livelock may 
still occur if the protocol is not well designed. For example, for an LI miss, the Downgrade rule may fire 
immediately after the L2Resp but before any LoadHit or StoreHit rule fires. As a result, the LlMiss needs 
to be fired again but the Downgrade always happens after the response comes back, leading to livelock. We 
avoid this possibility by only allowing Downgrade to fire when neither LoadHit nor StoreHit can fire. 

To rigorously prove livelock freedom, we need to guarantee that some transition rule should eventually 
make forward progress and no transition rule can make backward progress. We give the following definition 
of livelock freedom. 

Theorem 4. After any sequence of transitions, if there exists a pending request from any processor, then 
within a finite number of transitions, some request at some processor will dequeue. 

In order to prove the theorem, we will show that for every transition rule, at least one request will make 
forward progress and move one step towards the end of the request and at the same time no other request 
makes backward progress; or if no request makes forward or backward progress for the transition, we show 
that such transitions can only be fired a finite number of times. Specifically, we define forward progress 
as a lattice of system states where each request in mRq (load or store) has its own lattice. Table shows 
the lattice for a request where the lower parts in the lattice correspond to the states with more forward 
progress. We will prove livelock freedom by showing that for any state transition in any cache, any request 
either moves down the lattice (making forward progress) or stays at the current position but never moves 
up. Moreover, transitions which keep the state of every request staying at the same position in the lattice 
can only occur a finite number of times. Specifically, we will prove the following lemma. 
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Table 4: Lattice for a request. For a load request, LI .miss = {LI .state = I V LI .state = S A pts > LI .rts). 
For a store request, LI .miss = {LI.state < M). bufferName^exist means a message exists in the buffer and 
hufferName-rdy means that the message is the head of the buffer. bufferName^rdy implies bufferName-exist. 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 
11 


LI.miss A -'Ll.busy 

LI.miss A LI.busy A c2pRq_exist A —< c2pRq_rdy 
LI.miss A LI.busy A c2pRq_rdy A L2.state = M A 

LI.miss A LI.busy A c2pRq_rdy A L2.state = M A 

LI.miss A LI.busy A c2pRq_rdy A L2.state = M A 

LI.miss A LI.busy A c2pRq_rdy A L2.state = M A 

LI.miss A LI.busy A c2pRq_rdy A L2.state = M A 

LI.miss A LI.busy A c2pRq_rdy A L2.state = S 

LI.miss A LI.busy A p2cRp_exist A -ip2cRp_rdy 
LI.miss A LI.busy A p2cRp_rdy 
-iLl.miss 


-'L2.busy 

L2.busy A p2cRq_exist A -ip2cRq_rdy 
L2.busy A p2cRq_rdy A ownerLl.state = M 
L2.busy A p2cRq_rdy A ownerLl.state < M 
L2.busy A -'p2cRq_exist 


Lemma 8. For a state transition except Downgrade, WriteBackReq and WriteBackResp, either a request de¬ 
queues from the mRq or at least one request will move down its lattice. For all the state transitions, no request 
will move up its lattice. Further, the system can only fire Downgrade, WriteBackReq and WriteBackResp 
for a finite number of times without firing other transitions. 

We need the following lemmas before proving Lemma 

Lemma 9. If an LI cacheline is busy, then exactly one request (GetS or GetM in c2pRq) or response (ToS 
or ToM in p2c) exists for the address and the LI cache. If the LI cacheline is non-busy, then no request or 
response can exist in its c2pRq and p2c. 

Proof. This lemma is a stronger lemma than Lemma We prove this by the induction on the transition 
sequence. For the base case, all the LI cachelines are non-busy and no message exists and thus the lemma 
is true. 

We only need to consider the cases where the busy flag changes or any request or response is enqueued 
or dequeued. Only the LlMiss, L2Resp, ShReqS and ExReqS rules need to be considered. 

For LlMiss, a request is enqueued to c2pRq and the LI cacheline becomes busy. For L2Resp, a response 
is dequeued and the LI cacheline becomes non-busy. For ShReqS and ExReqS, a request is dequeued but a 
response in enqueued. By the induction hypothesis, after the current transition, the hypothesis is still true 
for all the cases above, proving the lemma. □ 

Lemma 10. If an LI cacheline is busy, there must exist a request at the head of the mRq buffer for the 
address and the request misses in the LI. 


Proof. For the base case, all LI cachelines are non-busy and the lemma is true. 

We consider cases where the LI cacheline is busy after the transition. Only LlMiss can make an LI 
cacheline busy from non-busy and the rule requires a request to be waiting at the head of the mRq buffer. 
If the LI cacheline stays busy, then no rule can remove the request from the mRq buffer. By the induction 
hypothesis, the lemma is true after any transition. □ 


Lemma 11. If an L2 cacheline is busy, there must exist a request with the same address at the head of the 
c2pRq buffer in L2. 

Proof. The proof follows the same structure as the previous proof for Lemma □ 

Lemma 12. For a memory request in a c2pRq buffer, its type and pts equal the type and pts of a pending 
processor request to the same address at the head of the mRq at the LI cache. 


Proof. By Lemmas and 10 the LI cacheline with the same address must be busy and a pending processor 
request exists at the head of the mRq buffer. Only the LlMiss rule sets the type and pts of a memory request 
in a c2pRq buffer and they equal the type and pts from the processor request. □ 
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Lemma 13. For a memory response in a p2c bujfer, its type equals the type of a pending processor request 
to the same address at the LI cache and if the type = S, its rts is no less than the pts of that processor 
request. 


Proof. Similar to the proof of Lemma 12 the processor request must exist. Only the ShReqS and ExReqS 
rules set the type and rts of the response, and type equals the type of a memory request and if type = 5, rts 
is no less than the memory request. Then the lemma is true by Lemma [T^ □ 


Lemma 14. When the L2Resp rule fires, a request with the same address at the head of mRq will transition 
from an LI miss to an LI hit. 


Proof. Before the transition of L2Resp, the LI cacheline is busy, and a response is at the head of the p2c 
buffer. By Lemma |13[ if the pending processor request has type = M, then the memory response also has 
type = M and thus it is an LI hit. If the pending processor has type = S, also by Lemma |13[ the memory 
response has type = S and the rts of the response is no less than the pts of the pending request. Therefore, 
it is also an LI hit. □ 


Lemma 15 (Coverage). The union of all the entries in True. 

Proof. By Lemma|^ if LI.busy we can prove that c2pRq-exist V p2cRp-exist ^ True. 

Then it becomes obvious that the union of all the entries is true. □ 


Lemma 16 (Mutually Exclusive). The intersection of any two entries in Ta6/e[^*s False. 

Proof. For most pairs of entries, we can trivially check that the intersection is False. The only tricky cases 
are the intersection of entry 9 or 10 with an entry from 3 to 8. These cases can be proven False using 
Lemma which implies that c2pRq-exist A p2cRp-exist => False. □ 

Lemma^ Proof. We need to prove two goals. First, for each transition rule except Downgrade, WriteBackReq 
and WriteBackResp, at least one request will dequeue or move down the lattice. Second, for all transition 
rules no request will move up the lattice. 

We first prove that a transition with respect to address oi never moves a request with address 02 (^ oi) 
up its lattice. The only possible way that the transition affects the request with 02 is by dequeuing from a 
buffer which may make a request with 02 being the head of the buffer and thus becomes ready. However, 
this can only move the request with 02 down the lattice. 

Also note that each processor can only serve one request per address at a time, because the mRq is a 
FIFO. Therefore, for the second goal we only need to prove that requests with the same address in other 
processors do not move up the lattice. We prove both goals for each transition rule. 

For LoadHit and StoreFlit, a request always dequeues from the mRq and the lemma is satisfied. 

For the LlMiss rule, a request must exist and be in entry I in a table before the transition. And since 
busy = True after the transition, it must move down the lattice to one of entries from 2 to 10. Since the LI 
cacheline state does not change, no other requests in other processors move in their lattice. 

For the L2Resp rule, according to Lemma [T^ a request will move from LI .miss to LI .hit. In the lattice, 
this corresponds to moving from entry 10 to entry 11, which is a forward movement. For another request to 
the same address, the only entries that might be affected are entry 4, 5 and 6. However, since p2c is a FIFO 
and the response is ready in the p2c buffer before the transition, no WBRq can be ready in this p2c buffer 
for other requests with the same address and thus they cannot be in entry 5 or 6. If another request is in 
entry 4, the transition removes the response from the p2c and this may make the WBRq ready in p2c and 
thus the request moves down the lattice. In all cases, no other requests move up the lattice. 

For the ShReqS or ExReqS rule to fire, there exists a request in the c2pRq buffer which means the 
address must be busy in the corresponding LI (Lemma and thus a request exists in its mRq and misses 
the LI (Lemma [To|. This request, therefore, must be in entry 8 in Table|^ The transition will dequeue the 
request and enqueue a response to p2c and thus moves the request down to entry 9 or 10. For all the other 
requests with the same address, they cannot be ready in the c2pRq buffer since the current request blocks 
them, and thus they are not in entry 3 to 8 in the lattice. For the other entries, they can only possibly be 
affected by the transition if the current request is dequeued and one of them becomes ready. This, however, 
only moves the request down the lattice. 
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Table 5: System Components required for main memory. 


Component 

Format 

Message Types 

MemRq 

MemRq.entry = {type., addr., data) 

MemRq 

MemRp 

MemRp.entry = {addr, data) 

MemRp 

Mem 

Mem[addr] = {data) 

- 

mts 

- 

- 


The Req-M rule can only fire if a request is ready in c2pRq and the L2 is in the M state. According to 
Lemma and Lemma [TOl there exists a request in one mRq that is in entry 3 in a table. After the transition, 
this request will move to entry 4 or 5 or 6 and thus down the lattice. For all the other requests, similar to 
the discussion of ShReqS and ExReq_S, they either stay in the same entry or move down the lattice. 

Finally, we talk about the Downgrade, WriteBackReq and WriteBackResp rules. The Downgrade rule can 
only fire when the LI cacheline is non-busy, corresponding to entry 1 and 11 if the request is from the same 
LI as the cacheline being downgraded. Entry 1 cannot move up since it is the first entry. If a request is in 
entry 11, since it is an LI hit now, the Downgrade rule does not fire. For a request from a different LI, the 
Downgrade rule may affect entry 5 and 6. However, it can only move the request from entry 5 to 6 rather 
than the opposite direction. 

For the WriteBackReq rule, if the LI cacheline is in the S state, then nothing changes but a message is 
dequeued from the p2c buffer which can only move other requests down the lattice. If the LI cacheline has 
M state, then if a request to the same address exists in the current LI, the request must be a hit and thus 
WirteBackReq cannot fire. For requests from other Lis, they can only be affected if they are in entry 4. 
Then, the current transition can only move them down the lattice. 

For the WriteBackResp rule, the L2 cacheline moves from the M to the S state. All the other requests 
can only move down their lattice due to this transition. 

Finally, we prove that Downgrade, WriteBackReq and WriteBackResp can only fire a finite number of 
times without other transitions being fired. Each time Downgrade is fired, an LI cacheline’s state goes down. 
Since there are only a finite number of LI cachelines and a finite number of states. Downgrade can only be 
fired a finite number of times. Similarly, each WriteBackReq transition consumes a WBRq message which 
can only be replenished by the Req^M rule. And each WriteBackResp transition consumes a WBRp which 
is replenished by Downgrade and WriteBackReq and thus only has finite count. □ 

Theorem^^Proof. If there exists a pending request from any processor, by Lemmasome pending request 
will eventually dequeue or move down the lattice which only has a finite number of states. For a finite number 
of processors, since the mRq is a FIFO, only a finite number of pending requests can exist. Therefore, some 
pending request will eventually reach the end of the lattice and dequeue, proving the theorem. □ 


5 Main Memory 

For ease of discussion, we have assumed that all the data fit in the L2 cache, which is not realistic for some 
shared memory systems. A multicore processor, for example, has an offchip main memory which does not 
contain timestamps. For these systems, the components in Table and transition rules in Table need to 
be added. And for the initial system state, all the data should be in main memory with all L2 cachelines in 
I state, and mts = 0. 

Most of the extra components and rules are handling main memory requests and responses and the / 
state in the L2. However, mts is a special timestamp added to represent the largest rts of the cachelines 
stored in the main memory. The mts guarantees that cachelines loaded from the main memory have proper 
timestamps and thus can be properly ordered. 

Due to limited space, we only prove that the Tardis protocol with main memory still obeys sequential 
consistency (SC). The system can also be shown to be deadlock- and livelock-free using proofs similar to 
Section]^ For the SC proof, we only need to show that Lemma m and 1^ are true after the main memory 
is added. 

In order to prove these lemmas, we need the following simple lemma. 
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Table 6: State Transition Rules for Main Memory. 


Rules and Condition 

Action 


L2Miss 

MemRq.enq(5, addr, _) 


let {id, type, addr, pis) = c2pRq.get_msg{) 
condition: L2.[addr].state = I 

busy := True 


MemResp 

state := S 


let {addr, data) = MemRp 

I2data := data 


let {state, I2data busy, owner, wts, rts) = L2.[addr] 

busy := False 
wts := mts 

rts := mts 


L2Downgrade 

p2c.enq{owner, Req, addr, 

-) 

let {state, data, busy, owner, wts, rts) = L2.[addr] 
condition: state = M A busy = False 

busy := True 


L2Evict 

MemRq.enq{M, addr, data) 


let {state, data, busy, owner, wts, rts) = L2.[addr] 

state := I 


condition: state = S 

mts := max(rts, mts) 



Lemma 17. If an L2 cacheline is in the I state, no clean block exists for the address. 

Proof. We prove by induction on the transition sequence. The hypothesis is true for the base case since no 
clean block exists. If an L2 cacheline moves from S to I (through the L2Evict rule), the clean block (L2 
cacheline in S state) is removed and no clean block exists for that address. By the transition rules, while the 
L2 line stays in the I state, no clean block can be created. By the induction hypothesis, if an L2 cacheline 
is in the I state after a transition, then no clean block can exist for that address. □ 

For Lemma we only need to include Lemma [T7| in the original proof. For Lemmas and we need to 
show the following properties of mts. 

Lemma 18. If an L2 cacheline is in the I state, then the following statements are true. 

• mts is no less than the rts of all the copies of the block. 

• No store has happened to the address at ts such that ts > mts. 

• The data value of the cacheline in main memory comes from a store St which happened before the 
current physical time. And no other store St' has happened such that St.ts < St'.ts < mts. 

Proof. All the statements can be easily proven by induction on the transition sequence. For 5 —>■ / of an L2 
cacheline, since the end mts is no less than the rts, by Lemmas and all three statements are true after 
the transition. 

Consider the other case where the L2 cacheline stays in / state. Since no clean block exists (Lemma [l7|, 
the copies of the cacheline cannot change their timestamps and no store can happen. By the transition rules, 
the mts never decreases after the transition. So the hypothesis must be true after the transition. □ 


To finish the original proof, we need to consider the final case where the L2 cacheline moves from I to 
S state {MemResp rule). In this case, both wts and rts are set to mts. By Lemma 18 both Lemma|^and 
Lemma [3] are true after the current transition. 


6 Related Work 


Snoopy and director y |7|35| cache coherence are both popular coherence protocols and are widely adopted 
in multicore processors y 10 , multi-socket systems [3,40 and distributed shared memory systems 19 24 
The Tardis protocol 
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is a different yet as powerful coherence protocol. Tardis is conceptually simpler 
than a directory protocol and has excellent scalability. Some other timestamp based coherence protocols 


have also been proposed in the literature 13 33 but none of them are as simple and high performant 

as Tardis. 

Both model checking and formal proofs are popular in proving the correctness of cache coherence pro¬ 
tocols. Model checking based verification |4|[^[8l[^ |IH[T^[T4|[l6}|T^[^[27[|^[M|[^ is a commonly used 
technique, but even with several optimizations, it does not scale to automatically verify real-world systems. 


13 





























Many other works PPPipW prove the correctness of a cache coherence protocol by proving invariants 
as we did in this paper. Our invariants are in general simpler than what they had partly because Tardis is 
simpler than a directory coherence protocol. Finally, our proofs can be machine-checked along the lines of 
the proofs for a hierarchical cache coherence protocol 
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7 Conclusion 

We provided simple, yet rigorous proofs of correctness for a recently-proposed scalable cache coherence 
protocol. Future work includes generalizing the protocol to relaxed memory consistency models and proving 
correctness, and machine-checking the proofs. 
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